Return home

Click here to return to the main CU psychology R tutorials page.

0: What is this tutorial? What’s ggplot?

ggplot is a package in R that allows for highly customizable and pretty plots! Here, were going to learn a few basics of making plots using ggplot that will hopefully get you well on your way to making informative and beautiful data visualizations!

1: First Install/load ggplot2

#load the package
if (!require(ggplot2)) {
  install.packages(ggplot2)
} else {
  require(ggplot2)
}
## Loading required package: ggplot2
library(tidyverse)
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

2: Load in Sample Data (NHANES)

We’re going to practice here on a dataset from the 1990 NHANES (National Health and Nutrition Examination Survey). The variables are below.

Region - Geographic region in the USA: Northeast (1), Midwest (2), South (3), and West (4)

Sex - Biological sex: Male (1), Female (2)

Age - Age measured in months (we’ll convert this to years below)

Urban - Residential population density: Metropolital Area (1), Other (2)

Weight - Weight in pounds

Height - Height in inches

BMI - BMI, measured in kg/(m^2)

nhanes <- read.csv("NHANES1990.csv", stringsAsFactors = F)
nhanes$Age <- nhanes$Age/12 # convert age to years for convinience 

# Recoding factors
nhanes$Urban <- recode(nhanes$Urban, '1' = 'Metro Area', '2' = 'Non-Metro Area')
nhanes$Region <- dplyr::recode(nhanes$Region, '1' = 'Northeast', '2' = 'Midwest', '3' = 'South', '4' = 'West')
head(nhanes)
##    Region Sex      Age          Urban Weight Height  BMI
## 1   South   2 42.75000 Non-Metro Area  171.7   65.3 28.4
## 2    West   1 25.58333 Non-Metro Area  155.2   62.3 28.2
## 3    West   2 73.83333     Metro Area  166.7   59.2 33.5
## 4    West   1 38.16667     Metro Area  224.7   71.9 30.6
## 5 Midwest   1 74.00000 Non-Metro Area  245.0   67.7 37.6
## 6    West   2  2.75000     Metro Area   28.3   35.2 16.0

3: Scatter Plot & Basic ggplot Syntax

-ggplot() command usage including saving to a variable with a <- ggplot()

-The general format is ggplot(data, aes(x = [x axis variable], y = [y axis variable]) - x and y variables are always specified in this aes() subfunction

Run this line of code

ggplot(nhanes, aes(x = Age, y = Weight))

  • axes are set up the way we’d expect, and seem to have sensible values
  • why is nothing on this graph yet? we haven’t put any graphic actually on the axes yet

We need to tell the ggplot() call what kind of graphic to put on the axis

  • A lot of the time, the syntax is geom_[something]

So lets do scatter with geom_point() first:

ggplot(nhanes, aes(x = Age, y = Weight)) + geom_point()

Wow, lots of data points! Maybe we can make the points smaller to see better

ageWeightPlot <- ggplot(nhanes, aes(x = Age, y = Weight)) + 
  geom_point(size = .1, alpha = .3)

ageWeightPlot

Remember, it’s a good habbit to save your plots to objects, not just draw them!

ageWeightPlot <- ggplot(nhanes, aes(x = Age, y = Weight)) + 
  geom_point(size = 1, alpha = .2, aes(color = factor(Region), pch = factor(Region)))

4: Dot Plot Organized by Grouping Factor

Question: what if we want to look at distribution of weights by region in the nhanes data?

With ggplot, if x is a factor (discrete, not continuous) we can plot dots as a function of the factor

heightByUrban <- ggplot(nhanes, aes(x = Urban, y = Height)) + 
  geom_point()

heightByUrban

Woah! So much data, hard to see anything, let’s use geom_jitter() with size = .05, width = .1 instead

heightByUrban <- ggplot(nhanes, aes(x = Urban, y = Height)) + geom_jitter(size = .05, width = .1)
heightByUrban

4a: Visualizing Distributions with Violin Plots

Still a TON of data! Here’s a great tool I really like for plotting the density of distributions of points like this!

# We can use 'source' to pull bits of code from github, as well as local files
source("https://gist.githubusercontent.com/benmarwick/2a1bb0133ff568cbe28d/raw/fb53bd97121f7f9ce947837ef1a4c65a73bffb3f/geom_flat_violin.R") 

heightByUrban + geom_flat_violin(,
                   position = position_nudge(x = .15, y = 0), alpha = .7)

Now, we can really begin to se the skewedness of these weight distributions.

5: Summary Plots

Plotting data points is all well and good, but what if we want to use our plots to summarize distributions? We’ll do that here:

weightByRegion <- ggplot(nhanes, aes(x = Region, y = Weight)) + stat_summary(fun.y= "mean", geom = "point")
weightByRegion

# People might want to know how to add a confidence interval
weightByRegionConf <- ggplot(nhanes, aes(x = Region, y = Weight)) + stat_summary(fun.data = "mean_cl_boot", fun.args=list(conf.int=.95))

weightByRegionConf

6: Fitting Lines To The Data

Let’s say we think there might be a linear relationship between height and weight

We can use geom_smooth for this and method = 'lm specifically for a linear model Also level = .95 can specify confidence interval about the estimate at each x value

heightByWeight <- ggplot(nhanes, aes(x = Height, y = Weight)) + geom_point() + stat_smooth(method = 'lm')
heightByWeight

Hmm, this actually looks like it’s giving us some pretty bad predictions. We’re not going to get into the stats of this now, but we can also plot using auto which is a mix of models, and might be a bit smarter

heightByWeight <- ggplot(nhanes, aes(x = Height, y = Weight)) + geom_point() + stat_smooth(method = 'auto', level = .99)
heightByWeight
## `geom_smooth()` using method = 'gam'

7: Titles, Axis Labels

The labs() command can be added to ggplot with different arguments, lke x, y, or title to make the plots clearer

heightByUrban + labs(x = 'Urban or Not ', y = 'Height in Inches', title = 'Height by Urban States')

Note, if we want to change labels for FACTORS, not just axes, it’s easier to do that using the tidyverse

weightByUrban <- ggplot(nhanes, aes(x = Region, y = Weight)) + geom_point() + 
  labs(x = 'Region of US', y = 'Weight (lbs)', title = 'Weight by Region')
weightByUrban

7: Facetting

Its useful to have several plots in a panel sometimes, not just one.

So for this data set, say we want to plot relationships between height and weight, but by region

We can do this with facet_wrap('Region')

facetPlot <- ggplot(nhanes, aes(x = Height, y = Weight)) + geom_point() + geom_smooth(method = 'lm') + facet_wrap('Region')
facetPlot

We can even do multiple factors

#Optional to explain the scales = 'free_x' part

multiFacet <- ggplot(nhanes, aes(x = Height, y = Weight)) + geom_point() + geom_smooth(method = 'lm') + facet_wrap(c('Region', 'Urban'), scales = 'free_y')
multiFacet

8: Color

We can color either continuously or discretely, depending on how a variable is represented in r

We put col = Height into the aes() function because it’s a grouping factor

Continous Example

weightByRegion <- ggplot(nhanes, aes(x = Region, y = Weight, col = Height)) + geom_jitter(width = .1) + 
  labs(x = 'Region of US', y = 'Weight (lbs)', title = 'Weight by Region')
weightByRegion

Discrete Example

discretePlot <- ggplot(nhanes, aes(x = Height, y = Weight, col = Region)) + geom_point() 
discretePlot

It is EASY to choose your own custom colors (as well as using R presets), but we’re not going to get into that right at this moment

9: Themes

We can use themes to make our plots prettier, and also customize the gridlines a lot

discretePlot + theme_bw()

discretePlot + theme_minimal()

discretePlot+ theme_void()

discretePlot + theme_classic()

facetPlot + theme_bw() + labs(title = 'Weight by Height and Region')

10: Saving To Files

Arguments for file , plot, dpi, width, and height

We can use a variety of file formats

ggsave('newPlotTest.pdf', plot = facetPlot, dpi = 300, width = 5, height = 5)

11: Final Points

Ggplot basic will get you a LONG way.

Also, there is much more ggplot can do for making your plots very pretty, and also plotting lots of complex models

Unlike excel and spss, which can often be cranky and difficult to bend to your will in customizing plots, ggplot is really easy to work with to make your graph look the way you want